† Corresponding author. E-mail:
Link prediction aims at detecting missing, spurious or evolving links in a network, based on the topological information and/or nodes’ attributes of the network. Under the assumption that the likelihood of the existence of a link between two nodes can be captured by nodes’ similarity, several methods have been proposed to compute similarity directly or indirectly, with information on node degree. However, correctly predicting links is also crucial in revealing the link formation mechanisms and thus in providing more accurate modeling for networks. We here propose a novel method to predict links by incorporating stochastic-block-model link generating mechanisms with node degree. The proposed method first recovers the underlying block structure of a network by modularity-based belief propagation, and based on the recovered block structural information it models the link likelihood between two nodes to match the degree sequence of the network. Experiments on a set of real-world networks and synthetic networks generated by stochastic block model show that our proposed method is effective in detecting missing, spurious or evolving links of networks that can be well modeled by a stochastic block model. This approach efficiently complements the toolbox for complex network analysis, offering a novel tool to model links in stochastic block model networks that are fundamental in the modeling of real world complex networks.
Networks are powerful tools to describe numerous complex social, biological, and information systems, where nodes are units of the system and links between nodes indicate their interactions or relations.[1] For example, an urban bus traffic system can be modeled by different types of networks for further analysis.[2] The analysis of the functions and dynamics of complex systems can then be conducted through network analysis, from network structure to network dynamics.[3–6] Due to the limited knowledge of a complex real system represented by a network, data collected are always partial or blended with noise. For example, 80% of the molecular interactions in yeast[7] and more than 99% in humans[8,9] are still unknown. Similarly, informant inaccuracy[10] or sampling biases[11] in social sciences often lead to social networks constructed with missing data or noise. In addition, laboratory experiments or manual judgments to validate links are often costly. Fortunately, such costs, unacceptable for blind checking of all possible interactions, can be sharply reduced by focusing on the most likely interactions predicted by efficient computational algorithm on the basis of known (observed) interactions.
Based on observed links and/or attributes of nodes in a network, link prediction attempts to estimate the likelihood of the existence of a link between two nodes. Reliable link prediction algorithms can be used to identify missing links, to predict evolving links that may appear in the future, to filter out spurious links in a noisy network, and more. Similarity-based link prediction is a formalism that associates the likelihood of two disconnected nodes to be connected to their similarity: more similar nodes are supposed to have higher likelihoods of being connected. If nodes’ attributes can be extracted efficiently, link prediction can be cast in the frame of calculating similarities between pairs of nodes in the Euclidean metric space.[12] Without involving nodes’ attributes but with the sole network structure, Fouss et al. found that a node in a network could be expressed as a vector in a Euclidean space that preserves Euclidean commute-time distance, and that the similarity between two nodes was the inner product of nodes’ corresponding vectors.[13] Actually, nodes’ similarities are not necessarily calculated by first mapping nodes into vectors when only the topological structure of a network is available. Topological similarity measures, designed to capture the similarity between two nodes via the sole network structure, give rise to a broad variety of structural similarity based methods for link prediction in complex networks,[14] such as methods based on counting common neighbors.[15–17] Methods not relying on structural similarity include maximal likelihood methods[18–20] by finding the best fit graph(s) (with pre-specified local structures such as blocks, motifs, hierarchy, and so on) to the observed networks and then scoring pairs of nodes by likelihoods, and probabilistic model methods[21–23] that use parameterized probabilistic models to derive the probability of the existence of a link. Although various link prediction methods are continuing to be designed, in this paper we focus on the structural similarity based link prediction methods, for their intuitiveness, driving interest in the community.
These methods can be roughly classified into three major classes:[14] global, local, and intermediate. The global similarity methods use the whole network topological information to compute nodes similarity. For example, the Katz index counts all paths connecting two nodes with more weights on shorter paths.[24] Other global similarity methods include random walk with restart[14] and average commute time index,[13] to name a few. Global methods need information of the whole network structure and output results with higher performance and computational complexity compared with local and intermediate approaches. Local similarity methods use only local structure information to define node similarity. For example, common neighbor (CN) counts the number of common neighbors between two nodes that are more likely to have a link if they share more common neighbors.[15] With different normalizations, the extension of CN includes the Salton[25] and Jaccard index. Building on the mechanism that generates scale-free networks without growth,[26] the preferential attachment index (PA) computes the probability of a link between two nodes proportional to their degrees. Compared to CN and its variants, PA does not require information on the neighbors of each node and thus reduces the computational complexity. With information on neighbors and their degrees, the Adamic–Adar (AA)[16] and resource allocation (RA) indices[17] compute structural similarity by assigning less-connected nodes more weights. AA and RA indices are similar in essence but weight differently large degree nodes. Finally, the intermediate approaches between global and local include local path index[27] and its variants, which are methods based on local random walks[14] and give a good tradeoff between performance and complexity. More details on various state-of-the-art link prediction methods can be found summarized in an excellent survey.[14]
An important feature of the link prediction method lies in their performances being associated to the mechanism(s) underlying networks’ formation.[14,18,19,28–31] If the principle of a link prediction method is consistent to the network formation mechanism, the method performs well providing accurate predictions. In contrast, if a link prediction method cannot capture well the underlying link formation mechanism, it will often not give satisfactory predictions. For example, the hierarchical random graph based link prediction framework with the assumption of hierarchical organization of networks tends to have high prediction accuracy if the observed network organizes hierarchically.[18] Similarly, if the observed network is organized into blocks[32,33] or distinct roles, the framework based on the stochastic block model (SBM)[34] will provide accurate link predictions.[19] Inspired by this fact, Wang et al. viewed link addition in network evolution as a kind of link prediction algorithm and employed likelihood analysis to judge which mechanism is better among a given series of network formation mechanisms for explaining the evolution of a network,[28] while Zhang et al. applied the likelihood analysis and link prediction methods to quantitatively measure how multiple mechanisms contribute to a network formation.[29] In directed network cases, Zhang et al. found that a non-observed link would be more likely to exist if the link generates more potential-definable subgraphs (such as Bi-fan structure consisting of 4 nodes and 4 directed links) deduced by the potential theory with clustering and homophily mechanisms.[30]
When nodes in a network are partitioned into blocks and the probability that two nodes are connected depends only on the blocks they belong to, SBM is well suited for capturing the network organization mechanisms. Actually, SBM is a powerful model for generating a wide variety of networks, which can capture ubiquitous and fundamental properties of real world networks, overarching the more specific models presented above: modular or block structure, core-periphery, hierarchical, multipartite structure, and so on. SBM generates networks using as the link formation mechanism the probability of the existence of a link between two nodes that depends only on the node's block memberships. However, nodes in a group/block are not always homogenous but often heterogeneous, which is a feature found in most empirical network data.[1,3,35] Therefore, we deem reasonable to model the process of real-world network formation by considering nodes’ specific characteristics. According to this observation, we propose to compute the structural similarity between two nodes by considering information on both block structures and node degrees. The method applies a modularity-based belief propagation algorithm[36] to recover the underlying block structure of a network and then calculates the connecting probability distribution over blocks. We name this method the block model preferential attachment index (BMPA). In this paper, we first describe the problem of link prediction in a network and introduce several state-of-the-art similarity based methods as baselines. Then we propose our novel method for link prediction and review the modularity-based belief propagation algorithm for block structure recovering. Testing and comparisons between BMPA and baseline methods on a set of real-world networks and synthetic networks generated by SBM are finally shown.
We focus on link prediction in an undirected and unweighted network G(V, E) with no self-connections nor multiple links, where V is the set of nodes and E the set of links. For missing link prediction, the set of links E is randomly divided into two sets: the training set
To quantify the ability of methods to predict links, we use the area under the receiver operating characteristic curve (AUC).[37] Similarity based link prediction associates node similarity to the likelihood of the link existence. By computing similarities of all pairs in
(1) |
AUC varies generally between 0.5 and 1, with AUC = 0.5 indicating random guesses and AUC = 1 indicating the perfect ability of the method to predict missing links.
Similarly, the AUC value of a method for spurious or evolving link prediction can be calculated by comparing the scores between a randomly picked spurious link in
For a given node pair i and j, link prediction methods compute the similarity score
With
(2) |
The PA method[26] assumes that the probability of a link connecting nodes i and j is proportional to their degrees
(3) |
Compared to CN, PA does not require the information on nodes’ neighborhood, and thus has lower computational complexity.
Different from CN that gives equal weights to each neighbor of a node, the AA method [16] assigns more weights to less-connected neighbors and computes the similarity score as
(4) |
The resource allocation dynamics motivated method, RA,[17] also uses unequal weights when counting common neighbors as AA, but penalizes the high-degree neighbors more heavily. The similarity score computed by RA is
(5) |
Nodes in a network are often organized into groups (or blocks) according to distinct mechanisms. For example, people often establish social relations with others according to their age, gender, professions, ethnicity, and so on. As indicated above, when predicting the existence of links in such kinds of networks, it is crucial to reflect this underlying link formation mechanism in the prediction method. SBM can well capture such a ubiquitous and fundamental property. In SBM, the likelihood or probability that two nodes are connected depends only on the groups they belong to. Therefore, in the frame of SBM, all nodes in one group have a common probability to connect to all nodes in another group. We therefore simply called the probability that two nodes are connected as the probability that the groups they belong to are connected, to reflect such a characteristic of SBM that all nodes in the same group are treated equally. However, in real-world networks, all nodes in a group are not always homogenous. They have their own characteristics or roles, simply and directly characterized by node degree. Thus, to well capture nodes’ similarity or the likelihood of a link existence, we combine these two factors.
Denote
(6) |
When combining the connected probability of node groups with node degree, the new similarity index between nodes i and j, BMPA, is defined as
(7) |
The basic idea of BMPA is similar to that of the degree-corrected block model that takes into account degree heterogeneity to partition networks.[38] However, the connecting probability is modeled in a degree–block model using a Poisson distribution with
To compute node similarity with Eq. (
(8) |
(9) |
Here
In this section, we test BMPA for link prediction on six real-world networks and synthetic networks generated by stochastic block model, and compare it with four high performance state-of-the-art link prediction methods.[14]
Comparisons between BMPA and other state-of-the-art methods are performed on six real-world networks. The six networks are chosen for their usefulness in revealing different performance behaviors among BMPA and the four baseline methods as popular benchmarks in this type of research, thus also allowing an easy comparison with current literature. These networks are: (i) zcNet, a well-known social network of a karate club at an American university observed by Zachary in the early 1970s,[40] with club members as nodes and their social interactions as links; (ii) pblgNet, a network of political blogs whose nodes are blogs about American politics and links the web hyperlinks between them;[41] (iii) dphNet, a social network of bottlenose dolphins living in Doubtful Sound in New Zealand,[42] whose nodes are dolphins and the links between nodes indicate that the dolphin pairs were seen together more often than expected by chance; (iv) sfiNet, a collaboration network of scientists at the Santa Fe Institute indicating coauthorship between these scientists;[32] (v) FWNet, a network of a food web in Florida Bay during the wet season,[43] representing carbon flow between species in the Bay; (vi) Yeast, a protein–protein interaction network in budding yeast[44] whose nodes are proteins and links their interactions. If networks are directed in their original form, their arcs are converted into undirected links. For disconnected networks, their largest connected components are used. Table
The experiments were conducted by evaluating the AUC performance on missing, evolving, and spurious link predictions. We formed the training and testing link sets of a network by randomly removing or adding a fraction of the total links in the network. The fraction of links removed was determined to keep all networks connected and at least half of the total links in the networks were removed if possible, while the maximal fraction of links added was equal to 1.
Figure
As shown in Fig.
The experiments above would suggest that PA is not the best choice for link prediction in networks. To confirm the association between good performance of link prediction and the appropriate link formation model, we further tested two additional networks: FWNet and Yeast. The results are shown in Fig.
Complementarily, we conducted experiments on a series of synthetic networks with determined internal structure, to confirm the expectation that BMPA would perform very well in this case (Section
As a final experiment on the real-world networks, we compared BMPA with a classical likelihood method for link reliability estimation[19] due to the shared SBM model used in the modeling network. Since the link reliability method (LRM) needs to sample a large number of possible partitions, it is a highly time-consuming method. In contrast, BMPA uses the consensus of several good partitions of a network to derive the link distribution over groups, and is thus scalable and efficient. Although it would be desirable to offer the comparison of BMPA with LRM together with other baseline methods in a single figure, the high computational costs of LRM makes it prohibitive to produce results on pblgNet and Yeast in the form adopted in Figs.
Table
Due to the predicting accuracy shown in this section and the high computational complexity of LRM, in the following section, we compared the performance of BMPA only with classical structural similarity based methods.
Synthetic networks in this section were generated by the stochastic block model. Given q groups of nodes with equal size, each pair of nodes i and j is connected with probability
We generate three sets of synthetic networks with
Figure
For missing or spurious link prediction, when more and more links were randomly removed or added, the block structure was gradually destroyed and thus the performance decreased with the fraction of links removed or added. While for evolving link prediction, the block structure was kept unchanged since randomly added links were not included in the training set of links. As a result, BMPA keeps performing at best with the fraction of added links in evolving link prediction. The reason is confirmed in Fig.
Finally, we explored the robustness of BMPA with respect to the original network partition (block detection), as this is inferred and not given information. Block detection is achieved with Modbp[34] chosen for its computational efficiency and statistically significant block finding property, although other approaches could be used. When the real number of blocks in a network is not known a priori, Modbp or its variant are able to automatically compute a narrow range of potential partition around the true value, and outputs the consensus among the significant partitions produced for this narrow range of values. We here show that performances of BMPA are robust to the partition obtained by using Modbp.
Figure
In this paper we proposed a new method for link prediction in networks. It is based on a novel link likelihood computation index, BMPA. BMPA is designed to predict links in stochastic block model networks, ubiquitous and appropriate models of real-world complex networks, and assumes that the likelihood of two nodes being connected depends not only on the characteristics of nodes but also their block membership. Therefore, BMPA adopts an efficient significant block detection algorithm, Modbp, to first obtain the information of the underlying block structure of a network and then derives the probability of link distribution between any two blocks. Combining information of node degree with block connecting probability, BMPA connects the similarity between two nodes to the likelihood of the existence of a link between them. Comparisons with other state-of-the-art link prediction methods on various real-world networks and sets of synthetic networks generated by SBM confirm the good performance of BMPA in missing, evolving, and spurious link prediction in an SBM and assimilated real-world network.
When networks are generated by SBM, BMPA that captures the underlying link formation mechanism identifies the missing, evolving, and spurious links with high accuracy, improving on state-of-the-art SBM prediction methods. Such BMPA complements the toolkit of instruments needed to predict links when networks are appropriately modeled by stochastic block models.
Although in the current paper Modbp is adopted to recover blocks that are statistically significant, making BMPA robust to the fluctuations of block structure, it is not the only method to recover the block structure in a network. As a consequence, various block structure finding methods[46] can be integrated into the link prediction framework proposed here. Exploration of how different block structure finding methods influence prediction results when integrated into the framework proposed here is the object of future work.
[1] | |
[2] | |
[3] | |
[4] | |
[5] | |
[6] | |
[7] | |
[8] | |
[9] | |
[10] | |
[11] | |
[12] | |
[13] | |
[14] | |
[15] | |
[16] | |
[17] | |
[18] | |
[19] | |
[20] | |
[21] | |
[22] | |
[23] | |
[24] | |
[25] | |
[26] | |
[27] | |
[28] | |
[29] | |
[30] | |
[31] | |
[32] | |
[33] | |
[34] | |
[35] | |
[36] | |
[37] | |
[38] | |
[39] | |
[40] | |
[41] | |
[42] | |
[43] | |
[44] | |
[45] | |
[46] |